brier_score_loss (Brier score)#
The Brier score is a proper scoring rule for probabilistic binary classification.
It measures the mean squared error between predicted probabilities and the actual outcomes.
In this notebook you will:
understand the definition and its range
build intuition with plots
implement
brier_score_lossfrom scratch in NumPyuse the Brier score as a differentiable training objective for a simple logistic regression model
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, brier_score_loss
from sklearn.model_selection import train_test_split
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
rng = np.random.default_rng(0)
def sigmoid(z: np.ndarray) -> np.ndarray:
z = np.asarray(z, dtype=float)
out = np.empty_like(z)
pos = z >= 0
out[pos] = 1.0 / (1.0 + np.exp(-z[pos]))
exp_z = np.exp(z[~pos])
out[~pos] = exp_z / (1.0 + exp_z)
return out
def logit(p: np.ndarray, eps: float = 1e-12) -> np.ndarray:
p = np.asarray(p, dtype=float)
p = np.clip(p, eps, 1.0 - eps)
return np.log(p / (1.0 - p))
def calibrate_logit_scale(p: np.ndarray, k: float, eps: float = 1e-12) -> np.ndarray:
# Monotone calibration distortion: p -> sigmoid(k * logit(p)).
return sigmoid(k * logit(p, eps=eps))
def calibration_bins(y_true: np.ndarray, y_prob: np.ndarray, n_bins: int = 10):
# Uniform-bin calibration curve + counts.
y_true = np.asarray(y_true, dtype=float).ravel()
y_prob = np.asarray(y_prob, dtype=float).ravel()
if y_true.shape != y_prob.shape:
raise ValueError("y_true and y_prob must have the same shape")
bins = np.linspace(0.0, 1.0, n_bins + 1)
bin_id = np.digitize(y_prob, bins) - 1
bin_id = np.clip(bin_id, 0, n_bins - 1)
mean_pred = np.full(n_bins, np.nan)
frac_pos = np.full(n_bins, np.nan)
counts = np.zeros(n_bins, dtype=int)
for b in range(n_bins):
mask = bin_id == b
counts[b] = int(mask.sum())
if counts[b] == 0:
continue
mean_pred[b] = float(y_prob[mask].mean())
frac_pos[b] = float(y_true[mask].mean())
return bins, mean_pred, frac_pos, counts
def brier_decomposition(y_true: np.ndarray, y_prob: np.ndarray, n_bins: int = 10):
# Murphy (1973) style decomposition using bins as 'forecast categories'.
y_true = np.asarray(y_true, dtype=float).ravel()
y_prob = np.asarray(y_prob, dtype=float).ravel()
bs = float(np.mean((y_prob - y_true) ** 2))
_, mean_pred, frac_pos, counts = calibration_bins(y_true, y_prob, n_bins=n_bins)
mask = counts > 0
n = y_true.size
w = counts[mask] / n
rel = float(np.sum(w * (mean_pred[mask] - frac_pos[mask]) ** 2))
y_bar = float(y_true.mean())
res = float(np.sum(w * (frac_pos[mask] - y_bar) ** 2))
unc = float(y_bar * (1.0 - y_bar))
return bs, rel, res, unc
1) Definition#
For binary events we observe labels
and we predict probabilities for the positive class
The Brier score (loss) is
Lower is better; 0 is perfect.
For binary classification with \(p_i \in [0,1]\), the score lies in \([0,1]\).
In scikit-learn, sklearn.metrics.brier_score_loss(y_true, y_prob, ...) expects y_prob = P(y=pos_label) (not hard class predictions).
Multiclass note#
sklearn.metrics.brier_score_loss is binary-only. A common multiclass generalization uses the probability vector \(p_i \in \mathbb{R}^K\) and a one-hot label \(e_{y_i}\):
p = np.linspace(0.0, 1.0, 501)
fig = go.Figure()
fig.add_trace(go.Scatter(x=p, y=(p - 1.0) ** 2, mode="lines", name="y=1: (1-p)^2"))
fig.add_trace(go.Scatter(x=p, y=(p - 0.0) ** 2, mode="lines", name="y=0: p^2"))
fig.update_layout(
title="Per-example Brier loss as a function of predicted probability",
xaxis_title="Predicted probability p = P(y=1)",
yaxis_title="Loss (p - y)^2",
template="plotly_white",
)
fig.show()
2) Proper scoring rule (why “probabilities” matter)#
Assume the true event probability is \(q\) and
If we always predict the same probability \(p\), the expected Brier score is:
The second term \(q(1-q)\) is the irreducible uncertainty in the labels.
The first term \((p-q)^2\) is the penalty for miscalibration.
So the expected Brier score is minimized at:
That’s the defining property of a (strictly) proper scoring rule: telling the truth (predicting the real probability) is optimal.
q = 0.3 # true event probability
n = 50_000
y = rng.binomial(1, q, size=n).astype(float)
p_grid = np.linspace(0.0, 1.0, 401)
empirical = np.mean((p_grid[None, :] - y[:, None]) ** 2, axis=0)
analytic = (p_grid - q) ** 2 + q * (1.0 - q)
fig = go.Figure()
fig.add_trace(go.Scatter(x=p_grid, y=empirical, mode="lines", name="empirical E[(p-Y)^2]"))
fig.add_trace(
go.Scatter(
x=p_grid,
y=analytic,
mode="lines",
name="analytic (p-q)^2 + q(1-q)",
line=dict(dash="dash"),
)
)
fig.add_vline(x=q, line=dict(color="black", dash="dot"))
fig.update_layout(
title=f"Expected Brier score is minimized at p = q (here q={q})",
xaxis_title="Forecast p",
yaxis_title="Expected Brier score",
template="plotly_white",
)
fig.show()
3) Best constant predictor + Brier skill score#
If you predict the same probability \(p\) for every sample, the Brier objective is
Taking the derivative and setting it to zero gives:
So the best constant forecast is simply the base rate.
A common normalization is the Brier Skill Score (BSS) against a reference forecast (often the base rate):
BSS = 1is perfectBSS = 0matches the referenceBSS < 0is worse than the reference
y = rng.binomial(1, 0.2, size=3000).astype(float)
p_star = y.mean()
p_grid = np.linspace(0.0, 1.0, 201)
losses = np.array([np.mean((p - y) ** 2) for p in p_grid])
fig = go.Figure()
fig.add_trace(go.Scatter(x=p_grid, y=losses, mode="lines", name="Brier(p)"))
fig.add_vline(
x=p_star,
line=dict(color="black", dash="dot"),
annotation_text=f"mean(y) = {p_star:.3f}",
)
fig.update_layout(
title="Best constant probability equals the base rate",
xaxis_title="Constant prediction p",
yaxis_title="Brier score",
template="plotly_white",
)
fig.show()
print(f"Best constant p*: {p_star:.4f}")
print(
f"Brier score at p*: {np.mean((p_star - y) ** 2):.4f} "
f"(≈ p*(1-p) = {p_star*(1-p_star):.4f})"
)
Best constant p*: 0.2087
Brier score at p*: 0.1651 (≈ p*(1-p) = 0.1651)
4) Calibration intuition (and the Murphy decomposition)#
Accuracy only uses a hard decision (e.g. p >= 0.5).
The Brier score uses the full probability values, so it can distinguish between:
a well-calibrated forecast: “when I say 0.8, it happens ~80% of the time”
an overconfident forecast: probabilities pushed too close to 0 or 1
an underconfident forecast: probabilities too close to 0.5
For binary forecasts, the Brier score can be decomposed into three interpretable parts (Murphy, 1973):
Using probability bins (or exact forecast categories), with:
\(p_k\) = average predicted probability in bin \(k\)
\(o_k\) = observed frequency in bin \(k\)
\(n_k\) = number of samples in bin \(k\)
\(\bar{y}\) = base rate
the components are:
Intuition:
Reliability is (binned) calibration error (smaller is better)
Resolution rewards separating cases with different true frequencies (bigger is better)
Uncertainty is a property of the dataset only
n = 10_000
x = rng.normal(size=n)
# A "true" probability model
p_true = sigmoid(1.5 * x - 0.25)
y = rng.binomial(1, p_true)
# Same hard predictions (threshold at 0.5), but different calibration
p_calibrated = p_true
p_overconf = calibrate_logit_scale(p_true, k=2.0)
p_underconf = calibrate_logit_scale(p_true, k=0.5)
models = {
"calibrated (truth)": p_calibrated,
"overconfident (k=2)": p_overconf,
"underconfident (k=0.5)": p_underconf,
}
print("Model comparison (same threshold decisions, different probability quality)")
for name, p in models.items():
y_hat = (p >= 0.5).astype(int)
acc = accuracy_score(y, y_hat)
bs = brier_score_loss(y, p)
_, rel, res, unc = brier_decomposition(y, p, n_bins=10)
approx = rel - res + unc
print(
f"{name:22s} accuracy={acc:.3f} brier={bs:.4f} "
f"rel={rel:.4f} res={res:.4f} unc={unc:.4f} (rel-res+unc≈{approx:.4f})"
)
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=("Reliability diagram (calibration curve)", "Predicted probability distribution"),
)
fig.add_trace(
go.Scatter(
x=[0, 1],
y=[0, 1],
mode="lines",
name="perfect calibration",
line=dict(color="black", dash="dash"),
),
row=1,
col=1,
)
for name, p in models.items():
_, mean_pred, frac_pos, counts = calibration_bins(y, p, n_bins=10)
mask = counts > 0
fig.add_trace(
go.Scatter(x=mean_pred[mask], y=frac_pos[mask], mode="markers+lines", name=name),
row=1,
col=1,
)
fig.add_trace(go.Histogram(x=p, name=name, opacity=0.45, nbinsx=30), row=1, col=2)
fig.update_xaxes(title_text="Mean predicted probability", range=[0, 1], row=1, col=1)
fig.update_yaxes(title_text="Observed frequency", range=[0, 1], row=1, col=1)
fig.update_xaxes(title_text="Predicted probability", range=[0, 1], row=1, col=2)
fig.update_yaxes(title_text="Count", row=1, col=2)
fig.update_layout(
barmode="overlay",
template="plotly_white",
title="Calibration affects Brier score even when accuracy is unchanged",
)
fig.show()
Model comparison (same threshold decisions, different probability quality)
calibrated (truth) accuracy=0.734 brier=0.1763 rel=0.0001 res=0.0703 unc=0.2477 (rel-res+unc≈0.1775)
overconfident (k=2) accuracy=0.734 brier=0.1879 rel=0.0110 res=0.0691 unc=0.2477 (rel-res+unc≈0.1896)
underconfident (k=0.5) accuracy=0.734 brier=0.1880 rel=0.0110 res=0.0687 unc=0.2477 (rel-res+unc≈0.1900)
5) From-scratch NumPy implementation#
For binary labels, the implementation is just mean squared error:
Two practical details (matching sklearn.metrics.brier_score_loss):
y_probshould be a probability for the positive class.If your labels are not
{0,1}, you must specify which value is positive (pos_label).
def brier_score_loss_np(y_true, y_prob, sample_weight=None, pos_label=None) -> float:
y_true = np.asarray(y_true).ravel()
y_prob = np.asarray(y_prob, dtype=float).ravel()
if y_true.shape != y_prob.shape:
raise ValueError("y_true and y_prob must have the same shape")
if pos_label is None:
unique = np.unique(y_true)
if np.all(np.isin(unique, [0, 1])):
pos_label = 1
else:
raise ValueError("pos_label must be specified when y_true is not in {0,1}")
y01 = (y_true == pos_label).astype(float)
y_prob = np.clip(y_prob, 0.0, 1.0)
sq = (y_prob - y01) ** 2
if sample_weight is None:
return float(np.mean(sq))
w = np.asarray(sample_weight, dtype=float).ravel()
if w.shape != y_true.shape:
raise ValueError("sample_weight must have the same shape as y_true")
if np.any(w < 0):
raise ValueError("sample_weight cannot contain negative weights")
return float(np.average(sq, weights=w))
# Quick sanity checks vs scikit-learn
y = rng.integers(0, 2, size=1000)
p = rng.random(1000)
print("sklearn:", brier_score_loss(y, p))
print("numpy :", brier_score_loss_np(y, p))
print("allclose:", np.allclose(brier_score_loss(y, p), brier_score_loss_np(y, p)))
# pos_label example
y_pm = rng.choice([-1, 1], size=1000)
p = rng.random(1000)
print("pos_label=1 (sklearn):", brier_score_loss(y_pm, p, pos_label=1))
print("pos_label=1 (numpy) :", brier_score_loss_np(y_pm, p, pos_label=1))
# sample_weight example
w = rng.random(1000)
print("weighted (sklearn):", brier_score_loss(y, p, sample_weight=w))
print("weighted (numpy) :", brier_score_loss_np(y, p, sample_weight=w))
sklearn: 0.3392506425102862
numpy : 0.3392506425102862
allclose: True
pos_label=1 (sklearn): 0.3285352441601213
pos_label=1 (numpy) : 0.3285352441601213
weighted (sklearn): 0.3431820394131602
weighted (numpy) : 0.3431820394131602
6) Brier vs log loss (cross-entropy)#
Both Brier score and log loss are proper scoring rules, but they penalize mistakes differently.
For a single example:
Brier: \((p-y)^2\) (bounded in \([0,1]\))
Log loss: \(-y\log p -(1-y)\log(1-p)\) (unbounded for confident wrong predictions)
Log loss heavily punishes being confidently wrong (e.g. predicting \(p \approx 0\) when \(y=1\)), while the Brier score is more forgiving.
p = np.linspace(1e-3, 1.0 - 1e-3, 500)
brier_y1 = (1.0 - p) ** 2
brier_y0 = p**2
logloss_y1 = -np.log(p)
logloss_y0 = -np.log(1.0 - p)
fig = make_subplots(rows=1, cols=2, subplot_titles=("y=1", "y=0"))
fig.add_trace(go.Scatter(x=p, y=brier_y1, mode="lines", name="Brier"), row=1, col=1)
fig.add_trace(go.Scatter(x=p, y=logloss_y1, mode="lines", name="Log loss"), row=1, col=1)
fig.add_trace(go.Scatter(x=p, y=brier_y0, mode="lines", name="Brier"), row=1, col=2)
fig.add_trace(go.Scatter(x=p, y=logloss_y0, mode="lines", name="Log loss"), row=1, col=2)
fig.update_xaxes(title_text="Predicted probability p = P(y=1)", row=1, col=1)
fig.update_xaxes(title_text="Predicted probability p = P(y=1)", row=1, col=2)
fig.update_yaxes(title_text="loss", row=1, col=1)
fig.update_yaxes(title_text="loss", row=1, col=2)
fig.update_layout(title="Per-example loss: Brier vs log loss", template="plotly_white")
fig.show()
7) Optimizing a simple model with the Brier loss (logistic regression)#
Because the Brier score is differentiable in the predicted probabilities, you can use it as a training objective.
Let a logistic model predict
where \(\sigma(t) = 1/(1+e^{-t})\).
The Brier objective is:
Using the chain rule:
So the gradients are:
A key difference vs log loss: the extra factor \(p(1-p) \le 0.25\) makes gradients smaller near 0/1, which can slow learning on confidently wrong examples.
# Synthetic 2D dataset so we can visualize the learned probabilities
X, y = make_classification(
n_samples=2000,
n_features=2,
n_redundant=0,
n_informative=2,
n_clusters_per_class=1,
class_sep=1.5,
flip_y=0.03,
random_state=0,
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=0, stratify=y
)
# Standardize (important for gradient descent)
mu = X_train.mean(axis=0)
sigma = X_train.std(axis=0) + 1e-12
X_train_s = (X_train - mu) / sigma
X_test_s = (X_test - mu) / sigma
def train_logreg_brier_gd(
X: np.ndarray,
y: np.ndarray,
lr: float = 1.0,
n_iter: int = 1000,
l2: float = 1e-3,
seed: int = 0,
):
rng_local = np.random.default_rng(seed)
n, d = X.shape
w = rng_local.normal(scale=0.01, size=d)
b = 0.0
history = {"brier": []}
for _ in range(n_iter):
z = X @ w + b
p = sigmoid(z)
loss = np.mean((p - y) ** 2) + 0.5 * l2 * np.sum(w**2)
history["brier"].append(float(loss))
dL_dz = (2.0 / n) * (p - y) * p * (1.0 - p)
grad_w = X.T @ dL_dz + l2 * w
grad_b = float(dL_dz.sum())
w -= lr * grad_w
b -= lr * grad_b
return w, b, history
w, b, history = train_logreg_brier_gd(X_train_s, y_train, lr=1.0, n_iter=1000, l2=1e-3, seed=0)
p_test = sigmoid(X_test_s @ w + b)
print("From-scratch logistic regression (trained with Brier loss)")
print(f"test accuracy : {accuracy_score(y_test, (p_test >= 0.5).astype(int)):.3f}")
print(f"test brier (numpy) : {brier_score_loss_np(y_test, p_test):.4f}")
print(f"test brier (sklearn): {brier_score_loss(y_test, p_test):.4f}")
fig = go.Figure()
fig.add_trace(go.Scatter(y=history["brier"], mode="lines", name="train Brier loss"))
fig.update_layout(
title="Gradient descent minimizing Brier loss",
xaxis_title="Iteration",
yaxis_title="Brier loss (+ L2 if set)",
template="plotly_white",
)
fig.show()
# Visualize the learned probability surface (test set)
x0_min, x0_max = X_test_s[:, 0].min() - 0.75, X_test_s[:, 0].max() + 0.75
x1_min, x1_max = X_test_s[:, 1].min() - 0.75, X_test_s[:, 1].max() + 0.75
grid_x0 = np.linspace(x0_min, x0_max, 220)
grid_x1 = np.linspace(x1_min, x1_max, 220)
xx, yy = np.meshgrid(grid_x0, grid_x1)
grid = np.c_[xx.ravel(), yy.ravel()]
proba = sigmoid(grid @ w + b).reshape(xx.shape)
fig = go.Figure()
fig.add_trace(
go.Contour(
x=grid_x0,
y=grid_x1,
z=proba,
contours=dict(start=0.0, end=1.0, size=0.1),
colorscale="RdBu",
opacity=0.75,
colorbar=dict(title="P(y=1)"),
)
)
fig.add_trace(
go.Contour(
x=grid_x0,
y=grid_x1,
z=proba,
contours=dict(start=0.5, end=0.5, coloring="lines"),
line=dict(color="black", width=2),
showscale=False,
name="p=0.5",
)
)
fig.add_trace(
go.Scatter(
x=X_test_s[:, 0],
y=X_test_s[:, 1],
mode="markers",
marker=dict(
color=y_test,
colorscale="Portland",
size=6,
line=dict(width=0.5, color="white"),
),
name="test points",
)
)
fig.update_layout(
title="Learned probability surface and decision boundary (test set)",
xaxis_title="feature 1 (standardized)",
yaxis_title="feature 2 (standardized)",
template="plotly_white",
)
fig.show()
# Calibration curve on the test set
_, mean_pred, frac_pos, counts = calibration_bins(y_test, p_test, n_bins=10)
mask = counts > 0
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=[0, 1],
y=[0, 1],
mode="lines",
name="perfect calibration",
line=dict(color="black", dash="dash"),
)
)
fig.add_trace(go.Scatter(x=mean_pred[mask], y=frac_pos[mask], mode="markers+lines", name="model"))
fig.update_layout(
title="Reliability diagram (test set)",
xaxis_title="Mean predicted probability",
yaxis_title="Observed frequency",
xaxis=dict(range=[0, 1]),
yaxis=dict(range=[0, 1]),
template="plotly_white",
)
fig.show()
# Practical usage: the same metric with a standard scikit-learn classifier
clf = LogisticRegression(max_iter=2000)
clf.fit(X_train_s, y_train)
p_sklearn = clf.predict_proba(X_test_s)[:, 1]
print("\nscikit-learn LogisticRegression (log loss optimized)")
print(f"test accuracy: {accuracy_score(y_test, (p_sklearn >= 0.5).astype(int)):.3f}")
print(f"test brier : {brier_score_loss(y_test, p_sklearn):.4f}")
From-scratch logistic regression (trained with Brier loss)
test accuracy : 0.980
test brier (numpy) : 0.0208
test brier (sklearn): 0.0208
scikit-learn LogisticRegression (log loss optimized)
test accuracy: 0.980
test brier : 0.0200
8) Pros, cons, and when to use it#
Pros
Proper scoring rule for probabilistic forecasts (encourages honest probabilities)
Interpretable as “MSE on probabilities”; bounded for binary problems
Sensitive to calibration; supports the reliability/resolution/uncertainty decomposition
Common in risk prediction, weather forecasting, calibration evaluation
Cons
Less punishing for confident wrong predictions than log loss (can matter for training)
Value depends on the base rate; comparing scores across datasets can be misleading
Binary-only in
sklearn.metrics.brier_score_loss(multiclass requires an extension)On rare events, raw Brier numbers can look deceptively small; use a skill score or compare to a baseline
9) Pitfalls, diagnostics, exercises#
Pitfalls
Passing class labels instead of probabilities (use
predict_probafor probabilistic classifiers)Mis-specified positive class (
pos_label) when labels are not{0,1}Interpreting Brier without a baseline on imbalanced data
Diagnostics
Plot a reliability diagram (calibration curve)
Report a baseline Brier score (base rate) and optionally Brier Skill Score
Exercises
Implement the multiclass Brier score and test it on a 3-class toy problem.
Compute Brier Skill Score using the base rate as reference.
Compare Brier vs log loss for a deliberately overconfident model.
References#
scikit-learn docs: https://scikit-learn.org/stable/modules/model_evaluation.html#brier-score-loss
Glenn W. Brier (1950), Verification of forecasts expressed in terms of probability
Allan H. Murphy (1973), A new vector partition of the probability score